finite versus infinite neural network
Finite Versus Infinite Neural Networks: an Empirical Study
We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.
Review for NeurIPS paper: Finite Versus Infinite Neural Networks: an Empirical Study
Correctness: Given the empirical nature of the paper, it's hard to directly evaluate its correctness. In terms of the empirical methodology, for the most part I think it was superb (as I discussed in the strengths'' section above). One thing that I'm curious about relating to the paper's results is the well-known fact that infinite width networks cannot learn any representations (i.e. the kernels don't depend on the data). On the other hand, a common hypothesis about why neural networks are so powerful is that they are really great at learning useful representations. Given that, it seems like there's some tension with the results of FCNs at finite width underperforming their kernel counterparts. I wonder if the problems studied were too simple to require the learning of useful representations (or the FCNs were too shallow to learn them, given they only had 3 layers).
Review for NeurIPS paper: Finite Versus Infinite Neural Networks: an Empirical Study
This paper conducted thorough experiments to compare performances of finite with and infinite width networks. The network architectures investigated are FNN, CNN with/without global average pooling (GAP), and two parameterizations of them (standard one and NTK one) were compared. Several techniques such as regularization and ensemble learning are applied to these methods. Throughout the experiments under several different settings, they derived several conclusions from several view points. They also developed best practices for using non-trainable kernels on the CIFAR-10 classification task.
Finite Versus Infinite Neural Networks: an Empirical Study
We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique.